245 research outputs found

    Unleashing Fine-Grained Parallelism on Embedded Many-Core Accelerators with Lightweight OpenMP Tasking

    Get PDF
    In recent years, programmable many-core accelerators (PMCAs) have been introduced in embedded systems to satisfy stringent performance/Watt requirements. This has increased the urge for programming models capable of effectively leveraging hundreds to thousands of processors. Task-based parallelism has the potential to provide such capabilities, offering high-level abstractions to outline abundant and irregular parallelism in embedded applications. However, efficiently supporting this programming paradigm on embedded PMCAs is challenging, due to the large time and space overheads it introduces. In this paper we describe a lightweight OpenMP tasking runtime environment (RTE) design for a state-of-the-art embedded PMCA, the Kalray MPPA 256. We provide an exhaustive characterization of the costs of our RTE, considering both synthetic workload and real programs, and we compare to several other tasking RTEs. Experimental results confirm that our solution achieves near-ideal parallelization speedups for tasks as small as 5K cycles, and an average speedup of 12 × for real benchmarks, which is 60% higher than what we observe with the original Kalray OpenMP implementation

    Optimization Techniques for Parallel Programming of Embedded Many-Core Computing Platforms

    Get PDF
    Nowadays many-core computing platforms are widely adopted as a viable solution to accelerate compute-intensive workloads at different scales, from low-cost devices to HPC nodes. It is well established that heterogeneous platforms including a general-purpose host processor and a parallel programmable accelerator have the potential to dramatically increase the peak performance/Watt of computing architectures. However the adoption of these platforms further complicates application development, whereas it is widely acknowledged that software development is a critical activity for the platform design. The introduction of parallel architectures raises the need for programming paradigms capable of effectively leveraging an increasing number of processors, from two to thousands. In this scenario the study of optimization techniques to program parallel accelerators is paramount for two main objectives: first, improving performance and energy efficiency of the platform, which are key metrics for both embedded and HPC systems; second, enforcing software engineering practices with the aim to guarantee code quality and reduce software costs. This thesis presents a set of techniques that have been studied and designed to achieve these objectives overcoming the current state-of-the-art. As a first contribution, we discuss the use of OpenMP tasking as a general-purpose programming model to support the execution of diverse workloads, and we introduce a set of runtime-level techniques to support fine-grain tasks on high-end many-core accelerators (devices with a power consumption greater than 10W). Then we focus our attention on embedded computer vision (CV), with the aim to show how to achieve best performance by exploiting the characteristics of a specific application domain. To further reduce the power consumption of parallel accelerators beyond the current technological limits, we describe an approach based on the principles of approximate computing, which implies modification to the program semantics and proper hardware support at the architectural level

    On the Suspension Design of Paquitop, a Novel Service Robot for Home Assistance Applications

    Get PDF
    The general and constant ageing of the world population that has been observed in the last decade has led robotics researchers community to focus its aims to answer the ever-growing demand for health care, housing, care-giving, and social security. Among others, the researchers at Politecnico di Torino are developing a novel platform to enhance the performance offered by present-day issues, and to assess many others which were not even taken into consideration before they have been highlighted by the pandemic emergency currently in progress. This situation, in fact, made dramatically clear how important it is to have reliable non-human operators whom one can trust when the life of elderly or weak patients is endangered by the simple presence of other people. The platform, named Paquitop, features an innovative architecture conceived for omni-directional planar motion. The machine is designed for domestic, unstructured, and variously populated environments. Therefore, the mobile robot should be able to avoid or pass over small obstacles, passing through the capability to achieve specific person tracking tasks, and arriving to the need of operating with an high dynamic performance. Given its purpose, this work addresses the design of the suspension system which enables the platform to ensure a steady floor contact and adequate stability in every using condition. Different configurations of such system are then presented and compared through use-case simulations

    Enabling Mixed-Precision Quantized Neural Networks in Extreme-Edge Devices

    Get PDF
    The deployment of Quantized Neural Networks (QNN) on advanced microcontrollers requires optimized software to exploit digital signal processing (DSP) extensions of modern instruction set architectures (ISA). As such, recent research proposed optimized libraries for QNNs (from 8-bit to 2-bit) such as CMSIS-NN and PULP-NN. This work presents an extension to the PULP-NN library targeting the acceleration of mixed-precision Deep Neural Networks, an emerging paradigm able to significantly shrink the memory footprint of deep neural networks with negligible accuracy loss. The library, composed of 27 kernels, one for each permutation of input feature maps, weights, and output feature maps precision (considering 8-bit, 4-bit and 2-bit), enables efficient inference of QNN on parallel ultra-low-power (PULP) clusters of RISC-V based processors, featuring the RV32IMCXpulpV2 ISA. The proposed solution, benchmarked on an 8-cores GAP-8 PULP cluster, reaches peak performance of 16 MACs/cycle on 8 cores, performing 21x to 25x faster than an STM32H7 (powered by an ARM Cortex M7 processor) with 15x to 21x better energy efficiency.Comment: 4 pages, 6 figures, published in 17th ACM International Conference on Computing Frontiers (CF '20), May 11--13, 2020, Catania, Ital

    Decoupled motion planning of a mobile manipulator for precision agriculture

    Get PDF
    Thanks to recent developments in service robotics technologies, precision agriculture (PA) is becoming an increasingly prominent research field, and several studies were made to present and outline how the use of mobile robotic systems can help and improve farm production. In this paper, the integration of a custom-designed mobile base with a commercial robotic arm is presented, showing the functionality and features of the overall system for crop monitoring and sampling. To this aim, the motion planning problem is addressed, developing a tailored algorithm based on the so-called manipulability index, that treats the base and robotic arm mobility as two independent degrees of motion; also developing an open source closed-form inverse kinematics algorithm for the kinematically redundant manipulator. The presented methods and sub-system, even though strictly related to a specific mobile manipulator system, can be adapted not only to PA applications where a mobile manipulator is involved but also to the wider field of assistive robotics

    Enabling mixed-precision quantized neural networks in extreme-edge devices

    Get PDF
    The deployment of Quantized Neural Networks (QNN) on advanced microcontrollers requires optimized software to exploit digital signal processing (DSP) extensions of modern instruction set architectures (ISA). As such, recent research proposed optimized libraries for QNNs (from 8-bit to 2-bit) such as CMSIS-NN and PULP-NN. This work presents an extension to the PULP-NN library targeting the acceleration of mixed-precision Deep Neural Networks, an emerging paradigm able to significantly shrink the memory footprint of deep neural networks with negligible accuracy loss. The library, composed of 27 kernels, one for each permutation of input feature maps, weights, and output feature maps precision (considering 8-bit, 4-bit and 2-bit), enables efficient inference of QNN on parallel ultra-low-power (PULP) clusters of RISC-V based processors, featuring the RV32IMCXpulpV2 ISA. The proposed solution, benchmarked on an 8-cores GAP-8 PULP cluster, reaches peak performance of 16 MACs/cycle on 8 cores, performing 21 7 to 25 7 faster than an STM32H7 (powered by an ARM Cortex M7 processor) with 15 7 to 21 7 better energy efficiency

    A novel missense mutation in PSEN2 gene associated with a clinical phenotype of frontotemporal dementia

    Get PDF
    Background: In Familial Alzheimer's disease defects in three genes - the amyloid precursors protein (APP) gene on chromosome 21, the presenilin 1 (PSEN1) gene on chromosome 14 and the presenilin 2 (PSEN2) on chromosome 1- have been identified. More than 160 pathogenic missense mutations have been described in PSEN1, with wide clinic phenotypic variability. In PSEN2 only 11 missense mutations are known, in two of which (M239V and T122R) the clinical phenotype may be frontotemporal dementia-like. Methods: We present a novel PSEN2 mutation (Y231C) in an Italian patient who seven years ago, at age 55, manifested mood and behavioural disorders characterized by apathia, delusions, physical aggressive behaviour and psychomotor agitation. Language disturbances appeared one year later and mild memory loss three years later. The neuropsychological pattern suggested a main dysfunction in posterior temporal and parietal cortex. MRI showed diffuse atrophy, especially in posterior regions. Results: The genetic study showed an A-to-G mutation in exon seven of PSEN2 gene, resulting in tyrosine to cysteine substitution at residue 231. Conclusions: This new mutation confirms the variability of the phenotypes associated with PSEN2 mutations and justified the analysis of this gene in behavioural disturbances associated with degenerative dementia, at least in Italy in which PSEN2 mutations seems more frequent than in other countries

    Wheeled Mobile Robots: State of the Art Overview and Kinematic Comparison Among Three Omnidirectional Locomotion Strategies

    Get PDF
    In the last decades, mobile robotics has become a very interesting research topic in the feld of robotics, mainly because of population ageing and the recent pandemic emergency caused by Covid-19. Against this context, the paper presents an overview on wheeled mobile robot (WMR), which have a central role in nowadays scenario. In particular, the paper describes the most commonly adopted locomotion strategies, perception systems, control architectures and navigation approaches. After having analyzed the state of the art, this paper focuses on the kinematics of three omnidirectional platforms: a four mecanum wheels robot (4WD), a three omni wheel platform (3WD) and a two swerve-drive system (2SWD). Through a dimensionless approach, these three platforms are compared to understand how their mobility is afected by the wheel speed limitations that are present in every practical application. This original comparison has not been already presented by the literature and it can be used to improve our understanding of the kinematics of these mobile robots and to guide the selection of the most appropriate locomotion system according to the specifc application
    • …
    corecore